StatsAccessLogger: fixes connection gauge underflow crashes when decrementing metrics after Scope evictions.#43812
Conversation
ggreenway
left a comment
There was a problem hiding this comment.
I don't think your fix is quite right.
I ran the integration test you added without your code changes, and it fails in an assertion ASSERT(used() || amount == 0); in sub(). I think either the assertion is no longer valid in the case of evicted stats, or the stat is being set to unused incorrectly.
if (scope->evictable_) {
MetricBag metrics(scope->scope_id_);
CentralCacheEntrySharedPtr& central_cache = scope->centralCacheMutableNoThreadAnalysis();
auto filter_unused = []<typename T>(StatNameHashMap<T>& unused_metrics) {
return [&unused_metrics](std::pair<StatName, T> kv) {
const auto& [name, metric] = kv;
if (metric->used()) {
metric->markUnused();
return false;
} else {
unused_metrics.try_emplace(name, metric);
return true;
}
};
};
The above code assumes that a stat is only ever held by a single scope (or other holder of a reference), which isn't correct. cc @kyessenov .
I think the use of std::min around all the sub() calls means that it's likely the counter could be incorrect. Even if this change prevents it from going negative, I think it is still an incorrect count.
/wait
When evicting unused stats from the central cache, we need to ensure that gauges actively referenced by components like AccessLogState are not evicted. The use_count() > 1 check prevents this, but a previous bug in evictUnused where the lambda parameter std::pair<StatName, T> kv was captured by value caused artificial inflation of the use_count due to the deep copy. This broke eviction entirely across the codebase. This commit fixes evictUnused by taking const auto& kv by reference, avoiding the deep copy and correctly applying the use_count() > 1 safeguard. Furthermore, AccessLogState now properly holds a GaugeSharedPtr in its State struct so its active references prevent premature eviction by evictUnused. The erroneous std::min safeguard during gauge subtractions is also removed as AccessLogState gauges will no longer be unfairly cleared. Signed-off-by: Xuyang Tao <taoxuy@google.com>
Updated with a interface to not evict per metric. We need to keep gauge not evicted in the scope as that it can be looked-up and then dec/inc on the same gauge. @kyessenov |
|
/retest |
|
Here's an idea for another approach: add a new method to a scope to add a stat to the scope by it's GaugeSharedPtr. Then in the destructor of the FilterState, you can just directly re-add the existing gauge into the scope, without needing it's name/tag components. |
CMIIW, if gauge is evictable, it cannot be Imagine when a gauge is incremented and then evicted before decremented, there is another |
The central store is the That's why holding a reference to the stat in the FilterState makes this work: it keeps the metric and it's current value from being removed from the
In this case, because the FilterState holds a reference, both would be using the same stat for inc/dec, so the value will not be corrupted. |
|
/retest |
ggreenway
left a comment
There was a problem hiding this comment.
I like this solution; it's much cleaner and clearer.
@kyessenov can you also review this, especially the change to eviction logic?
|
/retest |
ggreenway
left a comment
There was a problem hiding this comment.
Mostly looks good, just a couple small things
/wait
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Xuyang Tao <taoxuy@google.com>
| // We no longer check for the log line here because only 1 test variant will see the log line | ||
| // due to log throttling in periodic logs as IPv4 and IPv6 run in the same process. | ||
| // stats_test.cc has an EXPECT_LOG_CONTAINS for this situation already. |
|
You forgot the DCO signoff on the last commit. Please fix; you'll have to force-push |
Removed redundant log line check due to log throttling. Signed-off-by: Xuyang Tao <taoxuy@google.com>
fed339f to
babe799
Compare
Thx for the reminder. Done! |
|
/retest |
…ementing metrics after Scope evictions. (envoyproxy#43812) The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow. Signed-off-by: Xuyang Tao <taoxuy@google.com> Signed-off-by: Nick Shipilov <nick.shipilov.n@gmail.com>
Description: Fixes connection gauge underflow crashes in the Stats Access Logger when decrementing metrics after Scope evictions.
The original code correctly attempted to prevent "zombie" gauges by re-resolving metrics against the central store (via scope_->gaugeFromStatNameWithTags) during request destruction. However, it tried to reconstruct the gauge's identity using gauge_->tagExtractedStatName(). This failed because dynamic access-log tags (like %REQUEST_HEADER(...)%) are not registered with Envoy's global extractors. The extraction process returned a mangled base name and empty tags, forcing Scope to create a new 0-valued gauge. Subtracting 1 from it immediately crashed Envoy with a counter underflow.
Fix: we will keep the gauge in the scope cache if it is non-zero
Risk Level: Low
Testing: Added StatsAccessLogIntegrationTest.ActiveRequestsGaugeScopeEviction, which synthetically forces an asynchronous scope eviction while a connection is still inflight. Verified that the gauge successfully decrements to 0 in the central store identically to a normal request finish.
Docs: NA
Release: NA
Platform Specific Features: no